{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "**原始方案：** https://www.kaggle.com/c/jpx-tokyo-stock-exchange-prediction/discussion/361482\n",
        "\n",
        "**方案解读**：  本次分享的方案非常简单，手动构建特征+lightgbm回归，最终得分0.356，获得了第二名。\n",
        "\n",
        "**代码作者**： aa\n",
        "\n",
        "**结构说明**：\n",
        "\n",
        "第一部分: 安装并导入依赖包;\n",
        "\n",
        "第二部分：加载处理数据;\n",
        "\n",
        "第三部分：训练模型，进行预测。\n",
        "\n",
        "\n",
        "# 第一部分: 安装并导入依赖包\n",
        "\n",
        "1. 安装依赖包"
      ],
      "metadata": {
        "id": "vXI03eew3ABh"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!pip install numpy pandas scipy sklearn && pip install -U lightgbm"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "ItgjZtMk2n6O",
        "outputId": "a060f460-f6e7-478c-a1ec-10b19893546c"
      },
      "execution_count": 1,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n",
            "Requirement already satisfied: numpy in /usr/local/lib/python3.8/dist-packages (1.21.6)\n",
            "Requirement already satisfied: pandas in /usr/local/lib/python3.8/dist-packages (1.3.5)\n",
            "Requirement already satisfied: scipy in /usr/local/lib/python3.8/dist-packages (1.7.3)\n",
            "Requirement already satisfied: sklearn in /usr/local/lib/python3.8/dist-packages (0.0.post1)\n",
            "Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.8/dist-packages (from pandas) (2022.7.1)\n",
            "Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.8/dist-packages (from pandas) (2.8.2)\n",
            "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.8/dist-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)\n",
            "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n",
            "Requirement already satisfied: lightgbm in /usr/local/lib/python3.8/dist-packages (3.3.5)\n",
            "Requirement already satisfied: scikit-learn!=0.22.0 in /usr/local/lib/python3.8/dist-packages (from lightgbm) (1.0.2)\n",
            "Requirement already satisfied: numpy in /usr/local/lib/python3.8/dist-packages (from lightgbm) (1.21.6)\n",
            "Requirement already satisfied: wheel in /usr/local/lib/python3.8/dist-packages (from lightgbm) (0.38.4)\n",
            "Requirement already satisfied: scipy in /usr/local/lib/python3.8/dist-packages (from lightgbm) (1.7.3)\n",
            "Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.8/dist-packages (from scikit-learn!=0.22.0->lightgbm) (3.1.0)\n",
            "Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.8/dist-packages (from scikit-learn!=0.22.0->lightgbm) (1.2.0)\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "2 导入依赖包,并固定随机数种子"
      ],
      "metadata": {
        "id": "iEKP7e3p3Vkk"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import numpy as np\n",
        "import pandas as pd\n",
        "import math\n",
        "import os\n",
        "from scipy import stats\n",
        "import lightgbm as lgb\n",
        "from sklearn.metrics import mean_squared_error\n",
        "\n",
        "import warnings\n",
        "warnings.filterwarnings('ignore')\n",
        "\n",
        "def seed_everything(seed):\n",
        "    os.environ['PYTHONHASHSEED'] = str(seed)\n",
        "    np.random.seed(seed)\n",
        "\n",
        "SEED=42\n",
        "seed_everything(SEED)"
      ],
      "metadata": {
        "id": "-IWQzt3L3TLI"
      },
      "execution_count": 2,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# 第二部分: 加载数据，并进行特征生成\n",
        "\n",
        "3. 读取数据"
      ],
      "metadata": {
        "id": "CoMz3QC53fD7"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "train_data = pd.read_csv(\"./stock_prices.csv\",parse_dates=[\"Date\"])\n",
        "train_data = train_data.drop(columns=['RowId','ExpectedDividend','AdjustmentFactor','SupervisionFlag']).dropna().reset_index(drop=True)\n",
        "\n",
        "test_data = pd.read_csv(\"./stock_prices_test.csv\",parse_dates=[\"Date\"])\n",
        "test_data =test_data.drop(columns=['RowId','ExpectedDividend','AdjustmentFactor','SupervisionFlag'])\n"
      ],
      "metadata": {
        "id": "A_MeEqQr3eck"
      },
      "execution_count": 3,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "4. 生成新的特征，删除不用的特征，并对Nan值和Inf值进行处理"
      ],
      "metadata": {
        "id": "EnxuUYWR6JR7"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# 生成特征\n",
        "def add_features(feats):\n",
        "    # 20个工作日收盘价涨幅\n",
        "    feats[\"return_1month\"] = feats[\"Close\"].pct_change(20)\n",
        "    # 40个工作日收盘价涨幅\n",
        "    feats[\"return_2month\"] = feats[\"Close\"].pct_change(40)\n",
        "    # 60个工作日收盘价涨幅\n",
        "    feats[\"return_3month\"] = feats[\"Close\"].pct_change(60)\n",
        "    # 20个工作日收盘价波动率\n",
        "    feats[\"volatility_1month\"] = (\n",
        "        np.log(feats[\"Close\"]).diff().rolling(20).std()\n",
        "    )\n",
        "    # 40个工作日收盘价波动率\n",
        "    feats[\"volatility_2month\"] = (\n",
        "        np.log(feats[\"Close\"]).diff().rolling(40).std()\n",
        "    )\n",
        "    # 60个工作日收盘价波动率\n",
        "    feats[\"volatility_3month\"] = (\n",
        "        np.log(feats[\"Close\"]).diff().rolling(60).std()\n",
        "    )\n",
        "    # 收盘价除以20个工作日收盘价移动平均线\n",
        "    feats[\"MA_gap_1month\"] = feats[\"Close\"] / (\n",
        "        feats[\"Close\"].rolling(20).mean()\n",
        "    )\n",
        "    # 收盘价除以40个工作日收盘价移动平均线\n",
        "    feats[\"MA_gap_2month\"] = feats[\"Close\"] / (\n",
        "        feats[\"Close\"].rolling(40).mean()\n",
        "    )\n",
        "    # 收盘价除以60个工作日收盘价移动平均线\n",
        "    feats[\"MA_gap_3month\"] = feats[\"Close\"] / (\n",
        "        feats[\"Close\"].rolling(60).mean()\n",
        "    )\n",
        "    \n",
        "    return feats\n",
        "\n",
        "# 使用0代理Nan值和Inf值\n",
        "def fill_nan_inf(df):\n",
        "    df = df.fillna(0)\n",
        "    df = df.replace([np.inf, -np.inf], 0)\n",
        "    return df\n",
        "\n",
        "# 定义要使用的特征\n",
        "features =['High','Low','Open','Close','Volume', 'return_1month', 'return_2month', 'return_3month', 'volatility_1month', 'volatility_2month', 'volatility_3month', 'MA_gap_1month', 'MA_gap_2month', 'MA_gap_3month']\n",
        "\n",
        "\n",
        "train_data = add_features(train_data)\n",
        "test_data = add_features(test_data)\n",
        "train_data=fill_nan_inf(train_data)\n",
        "test_data=fill_nan_inf(test_data)"
      ],
      "metadata": {
        "id": "cXDn8nd03vTM"
      },
      "execution_count": 4,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# 第三部分: 模型训练\n",
        "\n",
        "5. 定义指标计算函数"
      ],
      "metadata": {
        "id": "K3tf-O284uBY"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# 计算RMSE\n",
        "def feval_rmse(y_pred, lgb_train):\n",
        "    y_true = lgb_train.get_label()\n",
        "    return 'rmse', mean_squared_error(y_true, y_pred), False\n",
        "# 计算Pearsonr\n",
        "def feval_pearsonr(y_pred, lgb_train):\n",
        "    y_true = lgb_train.get_label()\n",
        "    return 'pearsonr', stats.pearsonr(y_true, y_pred)[0], True\n",
        "# 计算每天的收益\n",
        "def calc_spread_return_per_day(df, portfolio_size=200, toprank_weight_ratio=2):\n",
        "    assert df['Rank'].min() == 0\n",
        "    assert df['Rank'].max() == len(df['Rank']) - 1\n",
        "    weights = np.linspace(start=toprank_weight_ratio, stop=1, num=portfolio_size)\n",
        "    purchase = (df.sort_values(by='Rank')['Target'][:portfolio_size] * weights).sum() / weights.mean()\n",
        "    short = (df.sort_values(by='Rank', ascending=False)['Target'][:portfolio_size] * weights).sum() / weights.mean()\n",
        "    return purchase - short\n",
        "# 计算每天的Sharpe\n",
        "def calc_spread_return_sharpe(df: pd.DataFrame, portfolio_size=200, toprank_weight_ratio=2):\n",
        "    buf = df.groupby('Date').apply(calc_spread_return_per_day, portfolio_size, toprank_weight_ratio)\n",
        "    sharpe_ratio = buf.mean() / buf.std()\n",
        "    return sharpe_ratio#, buf\n",
        "# 按收益率对资产进行排序\n",
        "def add_rank(df):\n",
        "    df[\"Rank\"] = df.groupby(\"Date\")[\"Target\"].rank(ascending=False, method=\"first\") - 1 \n",
        "    df[\"Rank\"] = df[\"Rank\"].astype(\"int\")\n",
        "    return df\n",
        "# 计算得分\n",
        "def check_score(df,preds,Securities_filter=[]):\n",
        "    tmp_preds=df[['Date','SecuritiesCode']].copy()\n",
        "    tmp_preds['Target']=preds\n",
        "    \n",
        "    #Rank Filter. Calculate median for this date and assign this value to the list of Securities to filter.\n",
        "    tmp_preds['target_mean']=tmp_preds.groupby(\"Date\")[\"Target\"].transform('median')\n",
        "    tmp_preds.loc[tmp_preds['SecuritiesCode'].isin(Securities_filter),'Target']=tmp_preds['target_mean']\n",
        "    \n",
        "    tmp_preds = add_rank(tmp_preds)\n",
        "    df['Rank']=tmp_preds['Rank']\n",
        "    score=round(calc_spread_return_sharpe(df, portfolio_size= 200, toprank_weight_ratio= 2),5)\n",
        "    score_mean=round(df.groupby('Date').apply(calc_spread_return_per_day, 200, 2).mean(),5)\n",
        "    score_std=round(df.groupby('Date').apply(calc_spread_return_per_day, 200, 2).std(),5)\n",
        "    print(f'Competition_Score:{score}, rank_score_mean:{score_mean}, rank_score_std:{score_std}')"
      ],
      "metadata": {
        "id": "XStLdcyO4yNZ"
      },
      "execution_count": 5,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "6. 构建训练数据集和验证数据集"
      ],
      "metadata": {
        "id": "btpD98ZI6iWI"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "list_spred_h=list((train_data.groupby('SecuritiesCode')['Target'].max()-train_data.groupby('SecuritiesCode')['Target'].min()).sort_values()[:1000].index)\n",
        "list_spred_l=list((train_data.groupby('SecuritiesCode')['Target'].max()-train_data.groupby('SecuritiesCode')['Target'].min()).sort_values()[1000:].index)\n",
        "\n",
        "train_dataset = lgb.Dataset(train_data[train_data['SecuritiesCode'].isin(list_spred_h)][features],train_data[train_data['SecuritiesCode'].isin(list_spred_h)][\"Target\"],feature_name = features )\n",
        "validate_dataset = lgb.Dataset(train_data[train_data['SecuritiesCode'].isin(list_spred_l)][features], train_data[train_data['SecuritiesCode'].isin(list_spred_l)][\"Target\"],feature_name = features)"
      ],
      "metadata": {
        "id": "JLz4OLtE41nF"
      },
      "execution_count": 6,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "7. 训练模型"
      ],
      "metadata": {
        "id": "V2L7jsfJ6oS2"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "params_lgb = {'learning_rate': 0.005,'metric':'None','objective': 'regression','boosting': 'gbdt','verbosity': 0,'n_jobs': -1,'force_col_wise':True}  \n",
        "model = lgb.train(params = params_lgb, \n",
        "                train_set = train_dataset, \n",
        "                valid_sets = [train_dataset, validate_dataset], \n",
        "                num_boost_round = 3000, \n",
        "                feval=feval_pearsonr,\n",
        "                callbacks=[ lgb.early_stopping(stopping_rounds=300, verbose=True), lgb.log_evaluation(period=100)])    "
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "Z_zURkBh5BsD",
        "outputId": "b73ecde3-623b-45bf-dce0-85fb3db3f1c8"
      },
      "execution_count": 7,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Training until validation scores don't improve for 300 rounds\n",
            "[100]\ttraining's pearsonr: 0.0564218\tvalid_1's pearsonr: 0.0108141\n",
            "[200]\ttraining's pearsonr: 0.0680828\tvalid_1's pearsonr: 0.0133135\n",
            "[300]\ttraining's pearsonr: 0.0762279\tvalid_1's pearsonr: 0.0141654\n",
            "[400]\ttraining's pearsonr: 0.0823793\tvalid_1's pearsonr: 0.0145849\n",
            "[500]\ttraining's pearsonr: 0.0879873\tvalid_1's pearsonr: 0.0147099\n",
            "[600]\ttraining's pearsonr: 0.0934746\tvalid_1's pearsonr: 0.0149388\n",
            "[700]\ttraining's pearsonr: 0.0980855\tvalid_1's pearsonr: 0.0147947\n",
            "[800]\ttraining's pearsonr: 0.102637\tvalid_1's pearsonr: 0.0146655\n",
            "Early stopping, best iteration is:\n",
            "[568]\ttraining's pearsonr: 0.0918688\tvalid_1's pearsonr: 0.0150002\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "8. 在验证集上测试模型"
      ],
      "metadata": {
        "id": "hTzl6dZa7Yy9"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "preds=model.predict(test_data[features])\n",
        "print(math.sqrt(mean_squared_error(preds,test_data.Target)))\n",
        "check_score(test_data,preds)\n",
        "check_score(test_data,preds,list_spred_h)\n",
        "check_score(test_data,preds,list_spred_l)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "JvwNquU35HkT",
        "outputId": "34988aff-23a3-4a40-aa2c-f2d0a4094da9"
      },
      "execution_count": 8,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "0.023836449735498907\n",
            "Competition_Score:0.12494, rank_score_mean:0.10759, rank_score_std:0.86111\n",
            "Competition_Score:0.132, rank_score_mean:0.12325, rank_score_std:0.93367\n",
            "Competition_Score:0.21668, rank_score_mean:0.12284, rank_score_std:0.5669\n"
          ]
        }
      ]
    }
  ]
}